System and method for interactive multi-view video

ABSTRACT

Interactive multi-view video presents a new type of video capture system. Many video cameras are allocated to capture an event from various related locations and directions. The captured videos are compressed in control PCs and are sent to a server in real-time. The compressed video can also be transcoded through an off-line compression approach to further reduce the data amount. Users can subscribe to a new type of service that allows users to connect to the servers and receive multi-view videos interactively.

BACKGROUND

1. Technical Field

This invention is directed toward a system and method for interactivemulti-view video which includes a new type of video capture system andmethod, a new type of video format, a new type of both on-line andoff-line video compression, and a new type of video services.

2. Background Art

The current popularly used video form is so-called single-view video. Itconsists of one video clip that is captured from one video camera ormultiple video clips that are concatenated using sequential timeperiods. For any time instance, there is only one view of an event. Thiskind of video form is widely used in video streaming, broadcasting andcommunication in televisions (TVs), personal computers (PCs) and otherdevices.

When reviewing conventional multimedia services (like traditional TV,video-on-demand, video streaming, digital video disc (DVD), and so on),there exist several limitations. For example, in conventional multimediaservices, there is only one video stream for an event at any instance intime. Additionally, in conventional multimedia services, the viewingdirection at any time instance is selected by program editors. Users arein a passive position, unable to change the camera angle or view point.Furthermore, they can only watch what has been recorded and provided tothem and do not have the ability to select the viewing angles.

As an extension of the traditional single view video, EyeVision [1], isa sports broadcasting system co-developed by Carnegie MellonUniversity's computer vision professor Takeo Kanade. EyeVision employed30 camcorders to shoot the game at Superbowl 2001. The videos capturedfrom the 30 camcorders were all input to a video routing switcher and anedited video was broadcast to TV viewers. The EyeVision system, however,only provides users with one edited video without the ability for theuser to select viewing directions and exercise camera control. It alsoonly serves a TV audience and is not available in other multi-mediaformats.

In addition to EyeVision another multi-media device, a 3D videorecorder, was designed for recording and playing free-viewpoint video[3]. It first captures 2D video and then extracts the foreground fromthe background. Source coding is applied to create 3D foreground objects(e.g., a human). However, like EyeVision, the 3D recorder does not allowthe users to control the cameras. Additionally, the processing employedby the 3D video recorder necessitates the classification of theforeground from the background which requires substantial computationalassets.

With the increasing demand for multi-view video, standardization effortshave occurred recently [4][5]. The MPEG community has been working sinceDecember 2001 on the exploration of 3DAV (3D Audio-Visual) technology.Many very diverse applications and technologies have been discussed inrelation to the term 3D video. None of these applications focused oninteractivity, in the sense that the user has the possibility to choosehis viewpoint and/or direction within dynamic real audio-visual scenes,or within dynamic scenes that include 3D objects that are reconstructedfrom real captured imagery. With regard to the application scenarios ithas been found that the multi-view video is the most challengingscenario with most incomplete, inefficient and unavailable elements.This area requires the most standardization efforts in the near future.Furthermore, no standardization efforts have dealt with interactivity.

Therefore, what is needed is a system and method for efficientlycapturing and viewing video that has many streams of video at a giveninstance and that allows users to participate in viewing directionselection and camera control. This system and method should have a highdegree of accuracy in its calibration and provide for efficientcompression techniques. Furthermore, these compression techniques shouldfacilitate the exhibition of various viewing experiences. Optimally thehardware should also be relatively inexpensive. Such a system shouldallow the viewing audience to participate in various viewing experiencesand provide for special effects. Additionally, this system and methodshould be computationally efficient and should be robust to handlinglarge amounts of image and audio data, as well as user interactions.

It is noted that in the remainder of this specification, the descriptionrefers to various individual publications identified by a numericdesignator contained within a pair of brackets. For example, such areference may be identified by reciting, “reference [1]” or simply“[1]”. A listing of the publications corresponding to each designatorcan be found at the end of the Detailed Description section.

SUMMARY

As the use of cameras becomes more popular, computer processing powerbecomes stronger and network bandwidth becomes broader, users desire toleverage these advantages to pursue a richer multi-media experience.Moreover, it is highly desirable to capture comprehensively someimportant events, such as surgical and sports championship events, fromdifferent view points and angles.

The natural extension to the previously discussed single-view video formis the multi-view video form of the present invention. In multi-viewvideo multiple videos of an event or event space are simultaneouslycaptured at different view points and angles. These multi-view videosare compressed, transmitted, stored and finally delivered to users. Oneof the important features of the multi-view video of the invention isthat users can control the capturing of videos and select the viewing ofevents from different directions.

The new type of video capture system consists of video cameras, controlPCs, servers, network components and clients. Audio components can alsobe used to capture any associated audio. Multiple cameras, in oneembodiment tens or hundreds of video cameras, are allocated to capturingevents in an event place in a master-slave configuration. These camerasare controlled by one or more control PCs. Events in the event space aresimultaneously captured by the cameras from various view points anddirections. Then, these captured videos are compressed in the controlPCs and sent to one or more servers in real-time. The compressed videoscan then be either delivered to the end users in real-time or be furthercompressed by exploiting the spatial and temporal correlations amongthem.

Interactive multi-view video is a natural extension to the currentsingle-view video that is popularly used in media streaming,broadcasting, and communication. Interactive multi-view video meets thetrends of technology developments and customer demands. Interactivemulti-view video may have a strong impact to various media applicationslike media players, messaging systems and meeting systems.

The interactive multi-view video system of the invention has manyadvantages. It provides users with the selection of video streams andcontrol of the cameras which allow users to select viewing directions atany time instance. No classification of foreground and backgroundobjects is required for this interactive multi-view video system of theinvention unlike the prior systems. Additionally, more efficient codingis adopted by the interactive multi-view video system than prior videosystems, with a richer capability that facilitates the representation ofspecial effects.

In addition to the just described benefits, other advantages of thepresent invention will become apparent from the detailed descriptionwhich follows hereinafter when taken in conjunction with the drawingfigures which accompany it.

DESCRIPTION OF THE DRAWINGS

The specific features, aspects, and advantages of the invention willbecome better understood with regard to the following description,appended claims, and accompanying drawings where:

FIG. 1 is a diagram depicting a general purpose computing deviceconstituting an exemplary system for implementing the invention.

FIG. 2 is a simplified block diagram of the interactive multi-view videosystem according to the invention.

FIG. 3 is a simplified flow diagram of the overall calibration procedureemployed in the interactive multi-view video system of the invention.

FIG. 4A is an image of an exemplary calibration pattern used in oneembodiment of the system and method according to the invention.

FIG. 4B is a flow diagram of the pattern-based calibration employed inthe interactive multi-view video system of the invention.

FIG. 5 is a flow diagram of the pattern-free calibration employed in theinteractive multi-view video system of the invention.

FIG. 6A is a diagram of the video index table used in the interactivemulti-view video system of the invention.

FIG. 6B is a diagram of the audio index table used in the interactivemulti-view video system of the invention.

FIG. 7 is a flow diagram depicting the on-line compression scheme forone camera of one embodiment of the invention.

FIG. 8 is a flow diagram depicting the intra-mode encoding of oneembodiment of the invention.

FIG. 9 is a flow diagram depicting the inter-mode encoding of oneembodiment of the invention.

FIG. 10 is a flow diagram depicting the static mode encoding of oneembodiment of the invention.

FIGS. 11A, 11B and 11C are schematics of the encoding architectures,inter-mode, intra-mode and static mode, respectively, of one embodimentof the invention.

FIGS. 12A and 12B provide a flow diagram depicting the encoding logicfor encoding the bit streams of multiple cameras.

FIG. 13 is a flow diagram depicting the off-line compression scheme ofone embodiment of the invention.

FIG. 14 is the architecture of the off-line compression system of oneembodiment of the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

In the following description of the preferred embodiments of the presentinvention, reference is made to the accompanying drawings that form apart hereof, and in which is shown by way of illustration specificembodiments in which the invention may be practiced. It is understoodthat other embodiments may be utilized and structural changes may bemade without departing from the scope of the present invention.

1.0 Exemplary Operating Environment

FIG. 1 illustrates an example of a suitable computing system environment100 on which the invention may be implemented. The computing systemenvironment 100 is only one example of a suitable computing environmentand is not intended to suggest any limitation as to the scope of use orfunctionality of the invention. Neither should the computing environment100 be interpreted as having any dependency or requirement relating toany one or combination of components illustrated in the exemplaryoperating environment 100.

The invention is operational with numerous other general purpose orspecial purpose computing system environments or configurations.Examples of well known computing systems, environments, and/orconfigurations that may be suitable for use with the invention include,but are not limited to, personal computers, server computers, hand-heldor laptop devices, multiprocessor systems, microprocessor-based systems,set top boxes, programmable consumer electronics, network PCs,minicomputers, mainframe computers, distributed computing environmentsthat include any of the above systems or devices, and the like.

The invention may be described in the general context ofcomputer-executable instructions, such as program modules, beingexecuted by a computer. Generally, program modules include routines,programs, objects, components, data structures, etc. that performparticular tasks or implement particular abstract data types. Theinvention may also be practiced in distributed computing environmentswhere tasks are performed by remote processing devices that are linkedthrough a communications network. In a distributed computingenvironment, program modules may be located in both local and remotecomputer storage media including memory storage devices.

With reference to FIG. 1, an exemplary system for implementing theinvention includes a general purpose computing device in the form of acomputer 110. Components of computer 110 may include, but are notlimited to, a processing unit 120, a system memory 130, and a system bus121 that couples various system components including the system memoryto the processing unit 120. The system bus 121 may be any of severaltypes of bus structures including a memory bus or memory controller, aperipheral bus, and a local bus using any of a variety of busarchitectures. By way of example, and not limitation, such architecturesinclude Industry Standard Architecture (ISA) bus, Micro ChannelArchitecture (MCA) bus, Enhanced ISA (EISA) bus, Video ElectronicsStandards Association (VESA) local bus, and Peripheral ComponentInterconnect (PCI) bus also known as Mezzanine bus.

Computer 110 typically includes a variety of computer readable media.Computer readable media can be any available media that can be accessedby computer 110 and includes both volatile and nonvolatile media,removable and non-removable media. By way of example, and notlimitation, computer readable media may comprise computer storage mediaand communication media. Computer storage media includes both volatileand nonvolatile, removable and non-removable media implemented in anymethod or technology for storage of information such as computerreadable instructions, data structures, program modules or other data.Computer storage media includes, but is not limited to, RAM, ROM,EEPROM, flash memory or other memory technology, CD-ROM, digitalversatile disks (DVD) or other optical disk storage, magnetic cassettes,magnetic tape, magnetic disk storage or other magnetic storage devices,or any other medium which can be used to store the desired informationand which can be accessed by computer 110. Communication media typicallyembodies computer readable instructions, data structures, programmodules or other data in a modulated data signal such as a carrier waveor other transport mechanism and includes any information deliverymedia. The term “modulated data signal” means a signal that has one ormore of its characteristics set or changed in such a manner as to encodeinformation in the signal. By way of example, and not limitation,communication media includes wired media such as a wired network ordirect-wired connection, and wireless media such as acoustic, RF,infrared and other wireless media. Combinations of the any of the aboveshould also be included within the scope of computer readable media.

The system memory 130 includes computer storage media in the form ofvolatile and/or nonvolatile memory such as read only memory (ROM) 131and random access memory (RAM) 132. A basic input/output system 133(BIOS), containing the basic routines that help to transfer informationbetween elements within computer 110, such as during start-up, istypically stored in ROM 131. RAM 132 typically contains data and/orprogram modules that are immediately accessible to and/or presentlybeing operated on by processing unit 120. By way of example, and notlimitation, FIG. 1 illustrates operating system 134, applicationprograms 135, other program modules 136, and program data 137.

The computer 110 may also include other removable/non-removable,volatile/nonvolatile computer storage media. By way of example only,FIG. 1 illustrates a hard disk drive 141 that reads from or writes tonon-removable, nonvolatile magnetic media, a magnetic disk drive 151that reads from or writes to a removable, nonvolatile magnetic disk 152,and an optical disk drive 155 that reads from or writes to a removable,nonvolatile optical disk 156 such as a CD ROM or other optical media.Other removable/non-removable, volatile/nonvolatile computer storagemedia that can be used in the exemplary operating environment include,but are not limited to, magnetic tape cassettes, flash memory cards,digital versatile disks, digital video tape, solid state RAM, solidstate ROM, and the like. The hard disk drive 141 is typically connectedto the system bus 121 through anon-removable memory interface such asinterface 140, and magnetic disk drive 151 and optical disk drive 155are typically connected to the system bus 121 by a removable memoryinterface, such as interface 150.

The drives and their associated computer storage media discussed aboveand illustrated in FIG. 1, provide storage of computer readableinstructions, data structures, program modules and other data for thecomputer 110. In FIG. 1, for example, hard disk drive 141 is illustratedas storing operating system 144, application programs 145, other programmodules 146, and program data 147. Note that these components can eitherbe the same as or different from operating system 134, applicationprograms 135, other program modules 136, and program data 137. Operatingsystem 144, application programs 145, other program modules 146, andprogram data 147 are given different numbers here to illustrate that, ata minimum, they are different copies. A user may enter commands andinformation into the computer 110 through input devices such as akeyboard 162 and pointing device 161, commonly referred to as a mouse,trackball or touch pad. Other input devices (not shown) may include amicrophone, joystick, game pad, satellite dish, scanner, or the like.These and other input devices are often connected to the processing unit120 through a user input interface 160 that is coupled to the system bus121, but may be connected by other interface and bus structures, such asa parallel port, game port or a universal serial bus (USB). A monitor191 or other type of display device is also connected to the system bus121 via an interface, such as a video interface 190. In addition to themonitor, computers may also include other peripheral output devices suchas speakers 197 and printer 196, which may be connected through anoutput peripheral interface 195. Of particular significance to thepresent invention, a camera 192 (such as a digital/electronic still orvideo camera, or film/photographic scanner) capable of capturing asequence of images 193 can also be included as an input device to thepersonal computer 110. Further, while just one camera is depicted,multiple cameras could be included as an input device to the personalcomputer 110. The images 193 from the one or more cameras are input intothe computer 110 via an appropriate camera interface 194. This interface165 is connected to the system bus 121, thereby allowing the images tobe routed to and stored in the RAM 132, or one of the other data storagedevices associated with the computer 110. However, it is noted thatimage data can be input into the computer 110 from any of theaforementioned computer-readable media as well, without requiring theuse of the camera 192. An audio recorder 198 can also be connected tothe computer via an audio interface device 199 for the purpose ofcapturing audio data.

The computer 110 may operate in a networked environment using logicalconnections to one or more remote computers, such as a remote computer180. The remote computer 180 may be a personal computer, a server, arouter, a network PC, a peer device or other common network node, andtypically includes many or all of the elements described above relativeto the computer 110, although only a memory storage device 181 has beenillustrated in FIG. 1. The logical connections depicted in FIG. 1include a local area network (LAN) 171 and a wide area network (WAN)173, but may also include other networks. Such networking environmentsare commonplace in offices, enterprise-wide computer networks, intranetsand the Internet.

When used in a LAN networking environment, the computer 110 is connectedto the LAN 171 through a network interface or adapter 170. When used ina WAN networking environment, the computer 110 typically includes amodem 172 or other means for establishing communications over the WAN173, such as the Internet. The modem 172, which may be internal orexternal, may be connected to the system bus 121 via the user inputinterface 160, or other appropriate mechanism. In a networkedenvironment, program modules depicted relative to the computer 110, orportions thereof, may be stored in the remote memory storage device. Byway of example, and not limitation, FIG. 1 illustrates remoteapplication programs 185 as residing on memory device 181. It will beappreciated that the network connections shown are exemplary and othermeans of establishing a communications link between the computers may beused.

The exemplary operating environment having now been discussed, theremaining parts of this description section will be devoted to adescription of the program modules embodying the invention.

2.0 A System and Method for Interactive Multi-View Video.

The system and method according to the invention is described in detailin the following sections. The system of interactive multi-view videoconsists of three primary parts: a capture component, a server componentand a client component.

2.1. Capture Component.

The capture component 202 of the interactive multi-view camera system ofthe invention is comprised of cameras (for example, video cameras),lenses, pan-tilt heads, control PCs and synchronization units. As shownin FIG. 2, in one embodiment of the invention, two video cameras 204 a,204 b each having its own pan-tilt head 206 a, 206 b and lens (e.g. azoom lens) 208 a, 208 b are connected to one control PC 210 and 1394port (not shown), respectively. Each camera has its own ID number. Thecontrol PC 210 can change the view point and angle of the camera bycontrolling the pan-tilt head 206 and lens 208 via, for example, a RS232interface. A synchronization unit 214 is linked to one or more controlPCs 210 preferably through their 1394 ports or other suitable means. Thecapture component of the system can also include audio recordingequipments 209 which record any audio at certain positions.

The synchronization unit 214 is used to make all of the cameras triggerand shoot at the same instant in time. Therefore, the control PCs cangrab videos from the cameras simultaneously. From all of these cameras,one is selected to be a master camera, while the rest are called slavecameras. The master camera is controlled by a camera man, while theslave cameras can be driven to point to the same interesting point asthe master camera. This is realized by a so-called master-slave trackingprocess. Typically the camera man is a person. In some cases, however,the master camera can be controlled by an object tracking algorithmwithout commands from a real camera man.

Control commands are input in the control PC of the master camera. Thepan-tilt parameters are calculated and transmitted to other control PCsto drive all the slave cameras. Captured videos are received, compressedand transmitted to servers by the control PC. In one embodiment of theinvention, each video is captured at a size of 640×480 and a frame rateof 30 frames per second. The detailed on-line compression procedure usedin one embodiment of the invention will be presented in Section 3.1.

2.1.1 Camera Calibration.

Before the master-slave tracking, the cameras should be calibrated. Acalibration process that determines the intrinsic parameters, extrinsicparameters, and hand-eye relationship is employed in the multi-viewvideo system of the invention. A general flow chart of this process isshown in FIG. 3. First, the intrinsic camera parameters are computed(process action 302), followed by the determination of the extrinsiccamera parameters (process action 304). Then, the hand-eye parametersare determined (process action 306). Finally, the determined intrinsic,extrinsic and hand-eye parameters are used to calibrate the cameras byadjusting the extrinsic parameters of all cameras in a common coordinatesystem. Given all of these parameters and the pan-tilt parameters of themaster camera, the pan-tilt parameters of the slave cameras which makethe slave cameras point to the same interesting point as the mastercamera can be efficiently computed and adjusted.

The intrinsic parameters are defined using the basic pin-hole cameramodel. They are only dependent on the intrinsic structure of the camera.They include the ratio of the focal length to the width of one imagepixel, the ratio of the focal length to the height of one image pixel,the x coordinate of the principle point and the y coordinate of theprinciple point. The extrinsic parameters are not dependent on theintrinsic structure of the camera. They define the location andorientation of the camera reference frame with respect to a known worldreference frame. They typically include a rotation matrix and a 3Dtranslation vector. The hand eye relationship parameters include thelocation and orientation of each camera with respect to its pan tilthead.

Two calibration methods, pattern-based calibration and pattern-freecalibration, are adopted in the multi-view interactive video system andmethod of the invention. The pattern-based calibration is realized byusing a large calibration pattern, preferably placed at the ground planeor other suitable reference plane, while the pattern-free calibrationexploits the information brought by the ground plane. These two methodsare described in more detail below.

2.1.2 Pattern-Based Calibration.

In one embodiment of the invention, a plane-based algorithm [2] is usedto calibrate the intrinsic parameters due to its accuracy andsimplicity. Such calibration should be performed only once over weeks asthe intrinsic parameters vary very slightly. The extrinsic parameters ofall cameras are calibrated in a common world coordinate system,preferably in the coordinate system of the pattern plane. Then thehand-eye relationship of each camera is also calibrated from itsextrinsic parameters at no less than three pan-tilt positions.

The pattern-based method uses images of a planar pattern with preciselyknown geometry. To make the pattern-based calibration automatic, in oneembodiment of the invention a special calibration pattern was designed,shown in FIG. 4B, which uses three kinds of colors (red, green, blue),to encode the positions of all corner points. An automatic procedure wasdesigned to capture the image of the pattern by the cameras undergoingdifferent pan-tilt motions, and then, to detect the corners of thepattern along with the color-encoded positions.

A simplified flow diagram of the pattern-based calibration is shown inFIG. 4A. The pattern is placed on the ground or other suitable referenceframe with its corners and possibly other reference points placed atknown locations (process action 402). All cameras then capture an imageof the calibration pattern (process action 404). By finding and usingthe correspondence between feature points extracted from the images andthe reference pattern points whose coordinates are known the extrinsiccamera parameters can be precisely estimated (process action 406) usingconventional techniques. In order to obtain an accurate calibration, thereference pattern should be precisely manufactured, and it should occupythe major part of the image used for calibration. Furthermore, in alarge scale system, setting up a large reference pattern with greataccuracy is not a trivial task that requires special equipment. In orderto avoid the inconvenience, a pattern-free calibration method wasdeveloped and is described below.

2.1.3 Pattern Free Calibration.

2.1.3.1 Overview of the Pattern Free Calibration Procedure.

In one embodiment of the invention, an automatic pattern-freecalibration tool is employed. In contrast with the pattern-based methodwhich uses the correspondences between image points and pattern pointsto determine the cameras extrinsic parameters, the pattern-freecalibration method is based on the correspondences between image pointsfrom different cameras. FIG. 5 provides a general flow diagram of thepattern free calibration procedure of the interactive multi-view videosystem of the invention. First, as shown in process action 502, oneextracts the feature points in each image of both the master and slavecameras. Using these feature points, a set of inter-image homographiesare estimated that map the features in each image to the image of themaster camera (process action 504). Then, a linear solution of theextrinsic parameters can be obtained based on these homographies,preferably using a Singular Value Decomposition (SVD) operation, asshown in process actions 506 and 508. SVD is a classical mathematicaloperation which can be used to find the eigen values and eigen vectorsfor a matrix. In the method used in the invention, SVD is used to findthe eigen values and their corresponding eigen vectors for the productmatrix of the homography of the feature points and its transpose. Basedon these obtained eigen components, the cameras' extrinsic parameterscan be estimated as a Least Square Solution to a set of linearequations. After this, as shown in process action 510, a bundleadjustment of the extrinsic camera parameters is applied to refine themby minimizing the sum of re-project errors of all featurecorrespondences. Using the estimated extrinsic parameters, one canproject the features in the master image (e.g. taken by the mastercamera) onto slave images (e.g., taken by the slave cameras). The term“re-project errors” refers to the errors between these featuresprojected onto the slave images and their corresponding features in themaster image. Using the sum of the project errors is a conventional wayof evaluating the accuracy of calibrated parameters. In one embodimentof the invention, the estimated parameters are refined by minimizing thesum of project errors using a Levenberg-Marquardt (LM) method.

2.1.3.2 Homography Estimation.

The pattern-free calibration technique of the invention can be morespecifically described as follows. In most environments, there is alwaysa dominating plane, typically the ground plane. When multiple camerasare set up in such a scenario, each of the cameras forms an image of acommon plane such as the dominating plane. For example, two images fromtwo cameras (one the master camera, and the other a slave) withdifferent positions looking at the ground plane are linked by a 3×3homography H defined by

$H \cong {{A_{2}( {R + \frac{{tn}^{T}}{d}} )}A_{1}^{- 1}}$

where A₁ and A₂ are the intrinsic matrices of the master and slavecameras, respectively. The symbol ≅ denotes equal up to a nonzero scale,because the homography can only be estimated up to a scale. R and t arethe extrinsic parameters of the slave camera (rotation and translation)in the reference coordinate frame of the master, and, n is the unitnormal vector of the ground plane.

Given more than four point correspondences between the two images (theyare not co-linear), there are various conventional techniques by which ahomography can be estimated. For example, the homography can beestimated by a basic computer vision algorithm named Direct LinearTransform (DLT). One embodiment of the invention employs a Random SampleConsensus (RANSAC) technique to estimate the homographies. This methodconsists of five steps:

1. Detecting feature points. In one embodiment a corner detectionoperator is used to detect features from two images.

2. Obtaining a hypothesis of corresponding feature sets by exploitingthe inter-image similarity of intensity around feature points.

3. Initializing the homography by a RANSAC algorithm.

4. Refining the homography to minimize the re-projection error in allcorresponding feature pairs by Levenberg-Marquardt algorithm.

5. Using the estimated homography to find more corresponding featurepairs. Here, Step 4 and 5 can be iterated several times to improve thehomography.

Once the homographies are obtained, the camera extrinsic parameters canbe linearly estimated by the following process.

2.1.3.3 Determination of Extrinsic Parameters.

For a homographyH, one denotes A₂ ⁻¹HA₁ by M, and M's eigen vectors byv_(j) (j=1,2,3). According to the properties of H, one can establishthree equations about n.

$\quad\{ \begin{matrix}{{v_{1}^{T}n} = 0} \\{{v_{2}^{T}n} = {{{sign}( b_{2} )}{b_{2}}( {a_{2} + 1} )}} \\{{v_{3}^{T}n} = {{{sign}( b_{3} )}{b_{3}}( {a_{3} + 1} )}}\end{matrix} $

Where b_(j) and a_(j) are two intermedial variables, and the value of|b_(j)| and a_(j) can be derived from the eigen values of M. This meansthat from one inter-image homography, one can obtain three equations ofn with unknown signs. If one has m+1 images of the planar scene capturedby m+1 cameras including a master camera, one can estimate mhomographies from the master image to the other images. And then, theeigenvalues and eigenvectors from each M can be further determined.Based on these, the above constraints can make up a set of 3m linearequations. This presents a potential way to estimate the normal vectorn. In practice, one can obtain an initial value of n by aninitialization step, and then, the signs in the above equations can bedetermined. Based on this, n can be further estimated. In one embodimentof the invention, a voting based initialization step is adopted todetermine the sign of b_(j), because two possible solutions can beobtained from one homography.

More specifically, the overall procedure can be described as:

Step 1. Acquire images; detect feature points; and estimate homographiesH via conventional methods or as described above.

Step 2. Calculate the eigenvalues and eigenvectors of M^(T)M by astandard SVD decomposition operation.

Step 3. Estimate an initial value for the normal vector n by a votingmethod.

Step 4. Determine the signs in equations, and then, refine the vector n.

Step 5. Estimate the translation t (up to scale) and rotation R.

Step 6. Bundle-adjust the extrinsic camera parameters by minimizing thesum of re-project errors of all feature correspondences.

2.2. Server Component.

The server is the most powerful unit in an interactive multi-view videosystem. It manages the transmission and storage of massive video datumand provides services for many clients. As shown in FIG. 2, the server216 is connected to two networks 218, 220. A network 218, such as forexample a wide band network backbone, is adopted to connect the server216 and control PCs 210 so that the compressed videos can be deliveredfrom the control PCs 210 to the server 216. In one embodiment of theinvention, the multi-view video system of the invention uses a 1 GBnetwork to connect the server 216 and all control PCs 210. An outsidenetwork 220 (e.g., a LAN, a WAN, or even the Internet) is used toconnect the server 216 with clients 222. In one embodiment of theinvention, the client 222 is connected to the server 216 via a 10/100 MBor above network. In another embodiment of the invention, the client 222is connected to the server 216 via the Internet.

2.2.1 Multi-View Video Format.

The server 216 receives the videos from control PCs 210, and then savesthem into a form of multi-view video or video beam. The video beamconsists of a set of video and preferably audio streams that were takensimultaneously of the same event or event space. The storage scheme ofthe interactive multi-view video of the invention supports massive videodata and efficient search of the video beam. In one embodiment of theinvention, an index structure is created to speed up the search. Themulti-view video of the invention is capable of maintaining the hugevideo beam and supporting a vast number of users accessing the beamsimultaneously. Its core technique is to use an index to facilitate thesearch of audio and video bit streams at any time instance. Examples ofthese index structures are shown in FIGS. 6A and 6B. FIG. 6A depicts theformat of the video bit streams 602 and FIG. 6B depicts the format ofthe audio bit streams 604 that correspond with the video bit streams.The actual video and audio data, along with the index files, are oftenstored on the server. They can also be stored locally at the client foroff-line playing. For example, the video beam can be stored on a DVDdisc and be played by any PC at the client.

Since the size of multi-view video might be very huge, a 64-bit pointeris used to represent the starting-point of any compressed multi-viewvideo frame in one embodiment of the invention. On the other hand, a32-bit pointer is sufficient to be used to represent the starting-pointof any compressed audio frame. Moreover, to reduce the time consumptionof locating the video bit stream as well as to reduce the size of thevideo index file, the 64-bit pointer is split into a 32-bit high-addresspointer and a 32-bit low-address pointer. A flag (e.g., named‘bCross4G’) is used to signal whether there is a transition in thehigh-address pointer or not. If the flag is set to ‘true’, then thelow-addresses should be checked. In that case, if the value of thecurrent low-address is smaller than that of the previous low-address,the high-address should be increased by 1 for the remaining pointersstarting from the current one.

The index of audio and video are saved to different files separately.The video index file is organized by a layered structure. The firstlayer is composed of many fields 606 (e.g., ‘VideoIndexlnfoHeader’fields), each of them containing a timestamp, an offset of the videoindex data, a 32-bit high address, a flag indicating whether there is atransition to a high address pointer or not (e.g., a ‘bCross4G’ flag),and the number of cameras employed at that time instant. The secondlayer contains the detailed video index data 610 (e.g., ‘VideoIndex’fields) with the same time stamp pointed by the first layer 608 as shownin FIG. 6A. Each field of the second layer consists of a camera ID, acoding type of that frame, and a 32-bit low-address pointer. Notice thatthe number of ‘VideoIndex’ fields for a certain time stamp equals thetotal number of cameras represented by the ‘byCameraNum’ in the‘VideoIndexlnfoHeader’ field. Also note the number of cameras atdifferent time stamps could be different.

An example of the structure of the video index is shown below.

// first layer Struct VideolndexlnfoHeader { DWORD dwTimeStamp; // timestamp of multi-view video DWORD dwOffset; // 32-bits offset of theVideolndexHeader DWORD dwOffsetHigh; // the high-address of the offsetBOOL bCross4G; // indicate whether the offsets have the samedwOffsetHigh or not BYTE byCameraNum; // total number of cameras at thattime stamp } //second layer Struct VideoIndex { BYTE byCameraID; // theID of camera, maximum 255 BYTE byFrameType; // coding type of the videoframe DWORD dwOffsetLow; // the low-address pointer }

The audio index file 604 is organized by a layered structure as well.The first layer is composed of many fields 614 (e.g.‘audioIndexlnfoHeader’), each of them containing a timestamp, an offsetof the audio index data, and the number of audio records at that timeinstant. The second layer 616 contains the detailed audio index data(e.g., ‘AudioIndex’ fields) with the same time stamp, as shown in FIG.6B. Notice the number of ‘AudioIndex’ fields for a certain time stampequals the total number of audio streams represented by ‘byAudioNum’ inthe ‘AudioIndexinfoHeader’ field. Also note that the number of audiostreams at different time stamps could be different.

An example of the structure of the audio index is shown below.

//first layer Struct AudioindexInfoHeader { DWORD dwTimeStamp; // timestamp of multi-view video DWORD dwOffset; // 32-bits offset of theAudioIndexHeader BYTE byAudioNum // total number of audios at that timestamp } //second layer Struct AudioIndex { BYTE byAudioID; // the ID ofaudio, maximum 255 DWORD dwOffset; // the 32-bits pointer }2.3 Client Component.

The received video beam can be either used directly for on-lineinteractive service or saved to disk for off-line processing. In thecontext of one embodiment of the system and method according to theinvention, on-line means the watched video beam is captured in realtime. Off-line means the video beam has been captured and stored at astorage medium. There are two types of off-line playing. One is that thevideo beam is stored at the server and the client plays it by astreaming process, as is done in video on demand (VOD) for example. Inthis mode, the server acts as a streaming server. Thus, this type ofoff-line playing is referred to as “Streaming Services”. The other typeof off-line play back occurs when the video beam is stored at a localdisk or another place. In this mode the client can play it without thehelp of the server.

For on-line interactive service, the server responds to user commandsfrom clients. The commands supported in one exemplary embodiment of theinvention include: switch, sweeping, freeze and rotate, and historyreview in addition to conventional commands in a typical media playersuch as a VCR. According to the user commands, the server generates avideo stream from the captured videos and then sends it to the client.In one embodiment of the invention, there are two communication channelsfor one client. One is a User Datagram Protocol (UDP) channel which isused to transmit audio/video data to reduce latency, and the other is aTransmission Control Protocol (TCP) channel used to transmit command andcontrol data for controlling the capturing cameras to ensurecorrectness. For off-line processing, the video beam is transcoded tofurther reduce the data amount. The detailed off-line compressionprocedure will be presented in Section 3.2. The details of the clientcomponent are discussed below.

2.3.1 On-line Services.

In on-line services, clients can remotely connect to the server in aLAN, a WAN, and even the Internet. Once the connection between theclient and the server is established, the user can subscribe in theclient part to the conventional commands as in a typical media playerand also subscribe to the ability to issue unique commands (such as, forexample, switching, sweeping, freeze and rotate, and history review) asin interactive multi-view.

The clients send their commands to the server. In response to the users'commands, the server will generate and transmit the expected video toeach client according to user's commands, respectively. In a word, userscan play the multi-view video interactively. In some cases, the userscan also input parameters such as camera ID and pan-tilt values to theclient. The client can transmit these parameters to the server and thento the control PCs to control the capturing cameras.

2.3.2 Off-line Services.

In off-line playing, the client can directly open a multi-view videobeam which is stored at a local disk or another place and play it. Inaddition to conventional effects as those in a typical video player (forexample, play, fast forward, rewind, pause, stop and so forth) users canexperience some fancy special effects including switching betweendifferent video streams, a sweeping effect and a freeze-and-rotateeffect, for example. A brief description of these special effects isprovided below.

In streaming mode, the client can remotely connect to the server via aLAN, a WAN, and even the Internet as in the on-line mode. In this modethe server component acts as a streaming server managing the clients'connections and video beams, and users can subscribe their commands tothe server to select their desired contents from video beams, and toview different video effects (for example, switching, sweeping, freezeand rotates history review and script). This mode is an extension ofcurrent Video on Demand (VoD) systems. The main difference betweenstreaming services and on-line service is that in the streaming mode,the video beams have been captured and stored at the server component,and are not captured in real time. The streaming services support allthe user commands list below.

Switching Effect: The switching effect involves the user being able toswitch between one camera viewpoint and another as the video continuesin time. This involves accessing the video streams from differentcameras that provide the desired point of view. One example is that auser switches from the viewpoint of the second camera in a sequence tothe viewpoint of the fifth camera.

Sweeping Effect: The sweeping effect involves sweeping through adjacentcamera views while time is still moving. It allows the user to view theevent from different viewpoints. One example is that, assuming there areeight viewpoints in total, a user starts from the first viewpoint, andswitches continuously to the second viewpoint, the third viewpoint andso on until the eighth viewpoint, and then watches at the eighthviewpoint.

Freeze and Rotate Effect: In the freeze and rotate effect, time isfrozen and the camera view point rotates about a given point. Oneexample is that, assuming there are eighth viewpoints in total, a userstarts from the first viewpoint, and switches continuously to thesecond, the third, and so on until the eighth viewpoint back and forth.

History Effect: In the history effect the user can play back thepreviously viewed or created video sequence.

Script: The user can also create a script of a set of views and specialeffects that can be played on demand. He or she can also send thisscript to other users who will, when the script is activated, observethe same scripted video events.

The Sweeping, Switching, and Freeze and Rotate effects can also beavailable in the on-line mode.

3.0 Compression Procedures.

Both on-line and off-line compression procedures can be used with theinteractive multi-view video system and method of the invention. Theon-line compression procedure is designed for real-time multi-view videocapturing. Its outputs may be either used directly for on-line service,or saved to disk for future processing (for example further off-linecompression or future play back). The off-line compression procedure isadopted in the transcoding process to compress the pre-encoded bitstream much more efficiently. After that, the output bit streams aresaved on disk for storage and off-line services.

It should be noted that although specific novel on-line and off-linecompression procedures are described in the sections below, the systemand method of the invention are not limited to these types ofcompression. Conventional compression algorithms could also be used.

3.1 On-line Compression.

In general, similar to the conventional single-view video coding, in theon-line compression used in one embodiment of the interactive multi-viewvideo system of the invention, each view of video can be coded in aformat of IPPP frames.

By way of background, typical video compression utilizes two basiccompression techniques inter-frame (P-frame) compression and intra-frame(I-frame) compression. Inter-frame compression is between frames and isdesigned to minimize data redundancy in successive pictures (e.g.,temporal redundancy). Intra-frame compression occurs within individualframes and is designed to minimize the duplication of data in eachpicture (e.g., spatial redundancy). In conventional video coding,intra-picture frames essentially encode the source image in the JPEGformat (with some differences). Typically blocks of pixels are runthrough a Discrete Cosine Transform (DCT) and are quantized on aper-macroblock basis. Intra-picture frames are not dependent on anyother frames and are used as ‘jump-in’ points for random access.Inter-frames, sometimes called predicted frames (P-frames), make use ofthe previous I or P frame to ‘predict’ the contents of the current frameand then compress the difference between the prediction and the actualframe contents. The prediction is made by attempting to find an areaclose to the current macroblock's position in the previous frame, whichcontains similar pixels. A motion vector is calculated which moves theprevious predicted region (typically with half pixel accuracy) to thecurrent macroblock. The motion vector may legitimately be a null vectorif there is no motion, which of course encodes very efficiently. Thedifference between the predicted pixels and their actual values arecalculated, DCT-transformed and the coefficients quantized (morecoarsely than I frame DCT coefficients). If a sufficiently similar groupof pixels cannot be found in the previous frame, a P frame can simplyspatially encode the macroblock as though it were an I-frame.

Like conventional video coding, there are two types of frames in theon-line compression algorithm of the invention: ‘I’ frames and ‘P’frames. The compression of each ‘I’ frame is only based on thecorrelations of that frame; while the compression of ‘P’ frame is basedon the correlations of that frame and its previous frame. Basicallyspeaking, the compression efficiency of the ‘P’ frame is much higherthan that of the ‘I’ frame. Although the ‘I’ frame cannot give efficientcompression, it is very robust to errors. Moreover, since each ‘I’ framedoes not depend on other frames, it can be easily accessed. This is whya typical video encoder will compress frames as ‘I’ frame periodically.

A big difference from the conventional schemes and the on-linecompression of the interactive multi-view video system of the invention,however, lies in a unique “STATIC” mode that is introduced to speed upthe predictive coding. To find the STATIC mode, it is necessary tocalculate the difference between the original image and a referenceimage. To further reduce the computing complexity, the decision ofwhether to use this STATIC mode or not is determined jointly among allviews. In this joint decision, the static regions of a certain view arefirst detected. Then their corresponding regions overlapped by theneighboring views are considered to be likely STATIC. And finally a verysimple check is applied to confirm the decision (In one embodiment ofthe invention, only a small portion of pixels are used to calculate thedifference between the original image and the reference image). In theSTATIC mode, the involved macroblock (MB) will be coded like thetraditional INTER mode, while its corresponding reference image, whichwill be used by the next frame for temporal prediction, is simply copiedfrom its previous reconstructed image. As a result, none ofde-quantization, inverse DCT and motion compensation is required forcreating the reference image of this MB.

In addition to the new coding mode, joint motion estimation (ME) is alsoapplied to reduce the complexity of ME. In this new ME, traditional MEis first applied for a certain view. A 3D MV is then created based onthe found MV of that view. After that, the 3D MV is projected to theneighboring views to predict their own MV. Based on the predicted MV,the search range of these views can be reduced and thus complexity canbe significantly reduced. For example, in conventional single-view videocoding, an encoder typically has to search within a 32×32 region inorder to find the motion vector of a certain macroblock. But in themulti-view video coding of the system and method according to theinvention, once the 3D motion is obtained and projected to a certainview, the search range of that view can be narrowed down (say, forexample, to 8×8 pixels), thus the computation of finding the motionvector of that view is significantly reduced. On the other hand, thisalso implies that the motion vectors of different views are correlated.Hence, these motion vectors can be further compressed. In one embodimentof this invention, only the difference between the true motion vector Vand the predicted vector {circumflex over (V)} obtained from other viewsare encoded.

A general exemplary flow chart of the on-line encoding scheme of theinvention for one camera is shown in FIG. 7. In this example, it isassumed that the system has three video cameras each capturing video at30 frames per second. The frame size is therefore 640×480 pixels. Hence,one needs to compress 3×30 frames per second. The compression of framescaptured by a single camera is considered first, then the case ofmultiple videos is discussed.

As shown in FIG. 7, process action 702, when encoding a frame, one firstpartitions the frame into blocks, preferably macroblocks (MBs), nomatter what type of frame it is. The size of a MB is 16×16 pixels—thatis, in the above example, one gets 640×480/16/16 MBs per frame. Eachframe is then compressed according to the pre-determined coding type.For each ‘I’ frame, all MBs are coded with intra-mode (process actions704, 708); whereas for the ‘P’ frame, there are three coding modes canbe chosen when encoding each MB. The mode decision is MB-based. In otherwords, different MBs in a ‘P’ frame could have different coding modes.In order to determine which mode to use, the encoder first performs amotion estimation operation for each MB to calculate the similarity ofthe current frame and its previous frame (process action 710). If thedifference is very large, which indicates there is almost no correlationfor that MB, the intra-mode will be chosen (process actions 712 and714). If the difference is very small, the ‘STATIC’ mode will be chosen(process actions 716, 718). As for the remaining case, the ‘INTER’ modewill be chosen (process action 720). This is the mode decision for theinput from one video stream only.

Below is the description of the three encoding modes for the on-linecompression. FIGS. 11A, 11B and 11C show the encoding architecture forthe above described modes (inter-mode, intra-mode and static mode,respectively).

-   1) Intra-mode: As shown in FIG. 8, the coefficients in each MB are    first transformed by a transformation or ‘T’ module to remove their    spatial correlations (process action 802). After that, the    transformed coefficients are quantized by a ‘Q’ module (process    action 804). (A simple example of the quantization process is as    follows: assume that one has two coefficients 67 and 16, and the    quantization level is 64. After the quantization, the first    coefficient becomes 64, while the second coefficient becomes 0. One    can see that the purpose of quantization is to remove the    uncertainty of the coefficients so that they can be coded easily. Of    course, some of the information will be lost after the    quantization). The quantized coefficients are encoded (e.g., by    using an ‘Entropy Coding’ module) (process action 806). Finally, one    obtains the compressed bit stream (process action 808).-   2) Inter-mode: As shown in FIG. 9, the current MB and previous    reference frame are first input (process action 902). A ‘motion    estimation’ process is then performed on the previous reference    frame, which is saved in the ‘Frame Buffer’, to find the most    similar regions of current MB (process action 904) (Note that the    motion estimation process is typically performed on the current MB    by the mode decision process as shown in FIG. 7 so it is not    necessary to do it again here.). After that, as shown in process    action 906, a Motion Compensation operation is applied to copy the    found regions from the ‘Frame Buffer’ by a motion compensation (MC)    module. Now one has two MBs, one is from the original frame and the    other is from the ‘MC’ module. These two MBs are similar, however,    there is still some difference between them. Their difference,    called the residue, is then transformed by the ‘T’ module and    quantized by the ‘Q’ module (process actions 908 and 910). Finally,    the quantized results are coded by an ‘Entropy Coding’ module    (process action 912). It is also necessary to update the reference    image for the next frame. This is achieved by an inverse    quantization module (‘Q-1’) and an inverse transform module (‘T-1’)    (as shown in process actions 914 and 916), and then adding the    recovered residue as a result of these actions onto the motion    compensated results (process action 918). After that, the encoder    has the same reference image as that in the decoder.-   3) Static mode: The static mode is the new mode employed by the    system and method of the invention. Its first part is very similar    to that of the inter-mode. However, there is a big difference in the    second part, i.e., creating the reference frame. In this new mode,    the new reference is just copied from the previous one; whereas in    the previous INTER mode, inverse quantization, inverse transform and    residue adding are required. As a result, a vast amount of    computation can be saved. A flow diagram of static mode processing    is shown in FIG. 10. As shown in FIG. 10, the current MB and    previous reference frame are first input (process action 1002). A    ‘motion estimation’ process is then performed on the previous    reference frame, which is saved in the ‘Frame Buffer’, to find the    most similar regions of current MB (process action 1004). (Note that    the motion estimation processed is typically performed by the mode    decision process as shown in FIG. 7. So it is not necessary to do it    again here.) After that, as shown in process action 1006, a ‘MC’    module (i.e., Motion Compensation) is applied to copy the found    regions from the ‘Frame Buffer’. Then, one has two MBs, one is from    the original frame and the other is the result from the ‘MC’ module.    The difference between these two MB is then transformed by the ‘T’    module and quantized by the ‘Q’ module (process actions 1008 and    1010). Finally, the quantized results are coded by the ‘Entropy    Coding’ module (process action 1012). As for the new reference    frame, it is simply obtained by copying the motion compensated MB    (process action 1014). It is important to point out that, in this    STATIC mode, the MB is not necessary to be really static, it could    contain motion. Moreover, when the mode decision threshold    determining whether to code the MB as a INTER mode or a STATIC mode    becomes very large, most of INTER mode MBs will be coded as STATIC    mode. In that case, the complexity can be reduced significantly,    while the performance will be sacrificed a bit. In one embodiment of    the invention, the above mode decision threshold is controlled to    achieve an appropriate tradeoff between the complexity and    performance.

The decoding process is just the inverse of the encoding process. Forexample, the compressed bit stream is first put into an entropy decoderto attain the quantized coefficients (as well as other necessaryinformation such as the coding mode of each MB). For each MB, accordingto their coding mode, the quantized coefficients are then de-quantized,inverse transformed, and so on.

How about the mode decision for multiple cameras then? Referring back tothe three cameras case and to FIGS. 12A and 12B. Video from the firstcamera will perform the mode decision exactly as presented before(process action 1202-1222). After that, one tries to establish thecorrespondence between the first camera and that of the remaining twocameras (process action 1224) using epipolar geometry and similarity ofthe image regions. Based on the correspondence, the coding mode of thesecond and the third cameras are estimated (process action 1226). Sincethe estimation is not always correct, these found coding modes and eventhe motion vectors need to be refined, which is achieved by a secondmode decision process (process action 1228) with less computing cost.Each MB is then coded based on the found coding mode (process action1230). Similar to the mode decision for a single view, this seconddecision process also calculates the difference between the original MBand the motion compensated MB. However, only the difference of a smallportion of the pixels is calculated. As a result, much of the complexityis reduced.

In the multi-view case, each view is decoded independently, the same asthat of the single-view case. If MV is predicted from the neighboringview, the MV of the neighboring view should be decoded first.

3.2 Off-line Compression

Off-line compression can be used to compress or further compress thevideo data streams. As shown in FIGS. 13 and 14, a key idea of off-linecompression is to decompose all views into a 3D mapping, which consistsof a group of feature points in the 3D environment. As shown in FIG. 13,process action 1302, each feature point is represented by its 3Dcoordinates (x, y, z) and the corresponding color components (Y, U, V).The created mapping is the minimum set of feature points that canreconstruct all of the pixels in each view. Different from thetransform-based decomposition such as DCT and DWT, this kind ofdecomposition is the most efficient one for decorrelating a multi-viewvideo. Clearly, when the number of views increases, only those newfeature points (i.e., the new information) need to be recorded, whereasothers can be found from the existing mapping.

After the 3D mapping creation, as shown in process action 1304, theobtained feature points are transformed to further decompose thecorrelations among them. The transformed results are quantized andencoded as a ‘base laye’ bit stream (process actions 1306, 1308). Thedequantized feature points are mapped back onto each view to form apredicted view image (process action 1310). The predicted image is closeto the original one; however, there are still some differences betweenthem. The difference is encoded independently as an ‘enhancement layer’of each view image as shown in process actions 1312, 1314 (theenhancement layer bit stream may be encoded in a scalable fashion toimprove the network adaptation capability). Moreover, the temporalcorrelations are further employed when encoding the two kinds of layers.This is because, in the time domain, the static part of the mappinginformation and the enhancement residue are invariant. As for the movingpart, it could still be compressed by the 3D motion structure.

An exemplary coding architecture for the off-line compression isdepicted in FIG. 14. It includes a 3D mapping creation module 1402,transformation modules 1404, quantization modules 1406, inversetransformation modules 1408, inverse quantization modules 1410, inversemapping modules 1412 and entropy encoding modules 1414, as well as viewbuffers 1416. To simplify the representations, only two views areconsidered in this example. For views captured at the i^(th) time, allview images and the cameras' positions are put into a ‘3D mappingcreation’ module to extract the feature points set M_(i). The mappinginformation M_(i) is then predicted from the previous reconstructedfeature point set {circumflex over (M)}_(i−1) to remove its temporalcorrelations. The predicted residues M_(i)−{circumflex over (M)}_(i−1)are transformed and quantized (either DCT or Discrete Wavelet Transform(DWT) or other transformation can be adopted here). Finally, entropycoding is applied to generate the base layer bit stream. Thereconstructed mapping information {circumflex over (M)}_(i) is then putinto an ‘Inverse Mapping’ module, along with the cameras' positions.After that, a predicted image for each view is attained. The differencebetween the predicted image and the original one is further decorrelatedby the temporal prediction. The residue is transformed and quantized(either DCT or DWT or other transformation can be adopted here).Finally, entropy coding is applied to generate the enhancement layer bitstreams. (In this example, two enhancement layer bit streams areyielded, one bit stream for each view.)

The decoding process is as follows. Assume that one wants to reconstructa certain view. The base layer is first decoded through entropydecoding, de-quantization, inverse transform, and so on (e.g., theinverse of the coding process of that layer). After that, theenhancement layer of that view is then decoded through entropy decoding,de-quantization, inverse transform, and so on. Finally, the obtainedcommon feature points (from base layer) are inverse mapped to that view.The attained image plus the enhancement layer decoded results form thereconstructed image of that view.

The foregoing description of the invention has been presented for thepurposes of illustration and description. It is not intended to beexhaustive or to limit the invention to the precise form disclosed. Manymodifications and variations are possible in light of the aboveteaching. It is intended that the scope of the invention be limited notby this detailed description, but rather by the claims appended hereto.

REFERENCES

-   1. http://www.ri.cmu.edu/projects/project 449.html-   2. Z. Zhang, “A flexible new technique for camera calibration”, IEEE    Transactions on Pattern Analysis and Machine Intelligence,    22(11):1330-1334, 2000.-   3. S. Wurmlin, E. Lamboray, O. G. Staadt, and M. H. Gross, “3D Video    Recoder”, Proc. of Pacific Graphics '02, pp. 325-334, Oct. 9-11,    2002.-   4. ISO/IEC JTC1/SC29/WG11 N5877, “Applications and requirements for    3DAV”, July 2003.-   5. ISO/IEC JTC1/SC29/WG 11 N5878, “Report on 3DAV exploration”, July    2003.

1. A system for capturing video of an event, comprising: a set of videocameras simultaneously capturing streams of images of the same eventfrom different viewpoints and angles, wherein one camera of the set ofvideo cameras is a master camera and the other cameras of the set areslaves to the master camera, wherein all cameras can be driven to pointto the same point viewed by the master camera and synchronized by asynchronization unit that causes all cameras to trigger and shoot at thesame instance in time, in order to simultaneously capture the same eventfrom different viewpoints; control computers for controlling said set ofvideo cameras and receiving and compressing the simultaneously capturedstreams of images from said set of video cameras; a connectionconnecting said set of video cameras to one or more control computersfor sending control commands to said set of video cameras and fortransferring said simultaneously captured streams of images from saidset of video cameras to said one or more control computers, the one ormore control computers compressing and transferring the streams ofimages; a server for, storing the simultaneously captured synchronizedstreams of images from the one or more control computers in the form ofa video beam, consisting of the synchronized streams of images of thesame event taken from different viewpoints and angles at the same time,indexed in a layered video index so that multiple clients can access oneor combinations of the simultaneously captured streams of images,wherein the layered video index comprises a first layer comprising atime stamp, an offset of video index data, a high address, a flagindicating whether there is a transition to a high address pointer and anumber of cameras employed, and a second layer comprising the same timestamp as the first layer, camera ID, a frame coding type and a lowaddress pointer, further compressing the streams of compressed images byexploiting the spatial and temporal characteristics among them, andproviding said simultaneously captured streams of images to one or moreclients; a first network connecting said server and said one or morecontrol computers for transferring said simultaneously capturedcompressed streams of images from said control computers to said serverin real time; and a second network for communicating between said serverand one more clients.
 2. The system of claim 1 wherein the streams ofimages are associated with corresponding audio streams.
 3. The system ofclaim 1 wherein at least one camera further comprises a zoom lens tovary the camera's field of view.
 4. The system of claim 1 wherein atleast one camera further comprises a pan tilt head to vary the viewpointof said camera.
 5. The system of claim 1 wherein the master camera iscontrolled by a camera man.
 6. The system of claim 5 wherein the mastercamera is controlled by an object tracking procedure.
 7. The system ofclaim 1 wherein the cameras in the set of cameras are calibrated using apattern comprising reference points at known locations.
 8. The system ofclaim 7 wherein the pattern is color-coded.
 9. The system of claim 7wherein the calibration of the set of cameras comprises: placing apattern on a common plane with corner and reference points at knownlocations; capturing at least one image of the pattern with each of saidset of cameras of the pattern; locating feature points in each image ofthe calibration pattern; finding correspondences between the featurepoints in the images to the known location of the reference points inthe pattern; and using these correspondences to find the extrinsicparameters each of the cameras.
 10. The system of claim 9 wherein thecommon plane is a ground plane.
 11. The system of claim 1 wherein thestreams of images are compressed in real-time.
 12. The system of claim 1wherein the streams of images are recorded.
 13. The system of claim 12wherein the streams of images are compressed prior to being recorded.14. The system of claim 1 wherein a user can control at least one cameraof the set of cameras by sending a command from the client to the one ormore control computers via the server.
 15. A computer-implementedprocess for providing a multi-view video comprising the process actionsof: simultaneously capturing synchronized streams of images of an eventspace with a set of cameras viewing the event space from differentviewpoints and angles; receiving said synchronized, simultaneouslycaptured streams of images from said set of cameras at least one controlcomputer; compressing said simultaneously captured streams of images inat least one control computer; transmitting said simultaneously capturedstreams of compressed images from the at least one control computer to aserver in real time; further compressing said streams of compressedimages by exploiting the spatial and temporal correlations among them;and simultaneously-providing combinations of said simultaneouslycaptured synchronized further compressed streams of compressed images,in the form of a video beam, consisting of the synchronized streams ofimages of the same event space taken from different viewpoints andangles at the same time, indexed using a video index with a layeredstructure, so that multiple clients can access one or combinations ofthe simultaneously further compressed captured streams of images fromthe server to at least one client, said video index comprising a firstlayer comprising a time stamp, an offset of video index data, a highaddress, a flag indicating whether there is a transition to a highaddress pointer and a number of cameras employed at a given time, and asecond layer comprising detailed video index data.
 16. Thecomputer-implemented process of claim 15 wherein a first network is usedto transfer the streams of video from the at least one control computerto the server and a second network is used to transfer the streams ofvideo from the server to the at least one client.
 17. Thecomputer-implemented process of claim 16 wherein the second network isthe Internet.
 18. The computer-implemented process of claim 15 whereinthe at least one client can be used to request which streams of imagesshould be sent to the client.
 19. The computer-implemented process ofclaim 15 wherein the at least one client can be used to control at leastone of the set of cameras.